{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 丢弃法\n",
    "\n",
    "除了前一节介绍的权重衰减以外，深度学习模型常常使用丢弃法（dropout）[1] 来应对过拟合问题。丢弃法有一些不同的变体。本节中提到的丢弃法特指倒置丢弃法（inverted dropout）。\n",
    "\n",
    "## 方法\n",
    "\n",
    "回忆一下，[“多层感知机”](mlp.ipynb)一节的图3.3描述了一个单隐藏层的多层感知机。其中输入个数为4，隐藏单元个数为5，且隐藏单元$h_i$（$i=1, \\ldots, 5$）的计算表达式为\n",
    "\n",
    "$$h_i = \\phi\\left(x_1 w_{1i} + x_2 w_{2i} + x_3 w_{3i} + x_4 w_{4i} + b_i\\right),$$\n",
    "\n",
    "这里$\\phi$是激活函数，$x_1, \\ldots, x_4$是输入，隐藏单元$i$的权重参数为$w_{1i}, \\ldots, w_{4i}$，偏差参数为$b_i$。当对该隐藏层使用丢弃法时，该层的隐藏单元将有一定概率被丢弃掉。设丢弃概率为$p$，\n",
    "那么有$p$的概率$h_i$会被清零，有$1-p$的概率$h_i$会除以$1-p$做拉伸。丢弃概率是丢弃法的超参数。具体来说，设随机变量$\\xi_i$为0和1的概率分别为$p$和$1-p$。使用丢弃法时我们计算新的隐藏单元$h_i'$\n",
    "\n",
    "$$h_i' = \\frac{\\xi_i}{1-p} h_i.$$\n",
    "\n",
    "由于$E(\\xi_i) = 1-p$，因此\n",
    "\n",
    "$$E(h_i') = \\frac{E(\\xi_i)}{1-p}h_i = h_i.$$\n",
    "\n",
    "即丢弃法不改变其输入的期望值。让我们对图3.3中的隐藏层使用丢弃法，一种可能的结果如图3.5所示，其中$h_2$和$h_5$被清零。这时输出值的计算不再依赖$h_2$和$h_5$，在反向传播时，与这两个隐藏单元相关的权重的梯度均为0。由于在训练中隐藏层神经元的丢弃是随机的，即$h_1, \\ldots, h_5$都有可能被清零，输出层的计算无法过度依赖$h_1, \\ldots, h_5$中的任一个，从而在训练模型时起到正则化的作用，并可以用来应对过拟合。在测试模型时，我们为了得到更加确定性的结果，一般不使用丢弃法。\n",
    "\n",
    "![隐藏层使用了丢弃法的多层感知机](../img/dropout.svg)\n",
    "\n",
    "## 从零开始实现\n",
    "\n",
    "根据丢弃法的定义，我们可以很容易地实现它。下面的`dropout`函数将以`drop_prob`的概率丢弃`NDArray`输入`X`中的元素。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import d2lzh as d2l\n",
    "from mxnet import autograd, gluon, init, nd\n",
    "from mxnet.gluon import loss as gloss, nn\n",
    "\n",
    "def dropout(X, drop_prob):\n",
    "    assert 0 <= drop_prob <= 1\n",
    "    keep_prob = 1 - drop_prob\n",
    "    # 这种情况下把全部元素都丢弃\n",
    "    if keep_prob == 0:\n",
    "        return X.zeros_like()\n",
    "    mask = nd.random.uniform(0, 1, X.shape) < keep_prob\n",
    "    return mask * X / keep_prob"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们运行几个例子来测试一下`dropout`函数。其中丢弃概率分别为0、0.5和1。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[[ 0.  1.  2.  3.  4.  5.  6.  7.]\n",
       " [ 8.  9. 10. 11. 12. 13. 14. 15.]]\n",
       "<NDArray 2x8 @cpu(0)>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = nd.arange(16).reshape((2, 8))\n",
    "dropout(X, 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[[ 0.  2.  4.  6.  0.  0.  0. 14.]\n",
       " [ 0. 18.  0.  0. 24. 26. 28.  0.]]\n",
       "<NDArray 2x8 @cpu(0)>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dropout(X, 0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[[0. 0. 0. 0. 0. 0. 0. 0.]\n",
       " [0. 0. 0. 0. 0. 0. 0. 0.]]\n",
       "<NDArray 2x8 @cpu(0)>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dropout(X, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 定义模型参数\n",
    "\n",
    "实验中，我们依然使用[“softmax回归的从零开始实现”](softmax-regression-scratch.ipynb)一节中介绍的Fashion-MNIST数据集。我们将定义一个包含两个隐藏层的多层感知机，其中两个隐藏层的输出个数都是256。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256\n",
    "\n",
    "W1 = nd.random.normal(scale=0.01, shape=(num_inputs, num_hiddens1))\n",
    "b1 = nd.zeros(num_hiddens1)\n",
    "W2 = nd.random.normal(scale=0.01, shape=(num_hiddens1, num_hiddens2))\n",
    "b2 = nd.zeros(num_hiddens2)\n",
    "W3 = nd.random.normal(scale=0.01, shape=(num_hiddens2, num_outputs))\n",
    "b3 = nd.zeros(num_outputs)\n",
    "\n",
    "params = [W1, b1, W2, b2, W3, b3]\n",
    "for param in params:\n",
    "    param.attach_grad()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 定义模型\n",
    "\n",
    "下面定义的模型将全连接层和激活函数ReLU串起来，并对每个激活函数的输出使用丢弃法。我们可以分别设置各个层的丢弃概率。通常的建议是把靠近输入层的丢弃概率设得小一点。在这个实验中，我们把第一个隐藏层的丢弃概率设为0.2，把第二个隐藏层的丢弃概率设为0.5。我们可以通过[“自动求梯度”](../chapter_prerequisite/autograd.ipynb)一节中介绍的`is_training`函数来判断运行模式为训练还是测试，并只需在训练模式下使用丢弃法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "drop_prob1, drop_prob2 = 0.2, 0.5\n",
    "\n",
    "def net(X):\n",
    "    X = X.reshape((-1, num_inputs))\n",
    "    H1 = (nd.dot(X, W1) + b1).relu()\n",
    "    if autograd.is_training():  # 只在训练模型时使用丢弃法\n",
    "        H1 = dropout(H1, drop_prob1)  # 在第一层全连接后添加丢弃层\n",
    "    H2 = (nd.dot(H1, W2) + b2).relu()\n",
    "    if autograd.is_training():\n",
    "        H2 = dropout(H2, drop_prob2)  # 在第二层全连接后添加丢弃层\n",
    "    return nd.dot(H2, W3) + b3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 训练和测试模型\n",
    "\n",
    "这部分与之前多层感知机的训练和测试类似。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 1, loss 1.1105, train acc 0.573, test acc 0.763\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 2, loss 0.5730, train acc 0.786, test acc 0.841\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 3, loss 0.4901, train acc 0.821, test acc 0.854\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 4, loss 0.4452, train acc 0.838, test acc 0.845\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 5, loss 0.4171, train acc 0.848, test acc 0.859\n"
     ]
    }
   ],
   "source": [
    "num_epochs, lr, batch_size = 5, 0.5, 256\n",
    "loss = gloss.SoftmaxCrossEntropyLoss()\n",
    "train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)\n",
    "d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,\n",
    "              params, lr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 简洁实现\n",
    "\n",
    "在Gluon中，我们只需要在全连接层后添加`Dropout`层并指定丢弃概率。在训练模型时，`Dropout`层将以指定的丢弃概率随机丢弃上一层的输出元素；在测试模型时，`Dropout`层并不发挥作用。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "net = nn.Sequential()\n",
    "net.add(nn.Dense(256, activation=\"relu\"),\n",
    "        nn.Dropout(drop_prob1),  # 在第一个全连接层后添加丢弃层\n",
    "        nn.Dense(256, activation=\"relu\"),\n",
    "        nn.Dropout(drop_prob2),  # 在第二个全连接层后添加丢弃层\n",
    "        nn.Dense(10))\n",
    "net.initialize(init.Normal(sigma=0.01))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面训练并测试模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 1, loss 1.1730, train acc 0.547, test acc 0.756\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 2, loss 0.5752, train acc 0.788, test acc 0.836\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 3, loss 0.4932, train acc 0.820, test acc 0.833\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 4, loss 0.4444, train acc 0.839, test acc 0.835\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 5, loss 0.4246, train acc 0.846, test acc 0.861\n"
     ]
    }
   ],
   "source": [
    "trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})\n",
    "d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None,\n",
    "              None, trainer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 小结\n",
    "\n",
    "* 我们可以通过使用丢弃法应对过拟合。\n",
    "* 丢弃法只在训练模型时使用。\n",
    "\n",
    "## 练习\n",
    "\n",
    "* 如果把本节中的两个丢弃概率超参数对调，会有什么结果？\n",
    "* 增大迭代周期数，比较使用丢弃法与不使用丢弃法的结果。\n",
    "* 如果将模型改得更加复杂，如增加隐藏层单元，使用丢弃法应对过拟合的效果是否更加明显？\n",
    "* 以本节中的模型为例，比较使用丢弃法与权重衰减的效果。如果同时使用丢弃法和权重衰减，效果会如何？\n",
    "\n",
    "\n",
    "\n",
    "## 参考文献\n",
    "\n",
    "[1] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. JMLR\n",
    "\n",
    "## 扫码直达[讨论区](https://discuss.gluon.ai/t/topic/1278)\n",
    "\n",
    "![](../img/qr_dropout.svg)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}